# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pioExploratory Data Analysis
Introduction
This section presents a detailed data analysis of job market trends in 2024, focusing on AI-driven changes, salary disparities, and employment trends across different regions and industries.
Data Import and Cleaning
Load dataset
df = pd.read_csv("lightcast_job_postings.csv")
# Display dataset summary
df.info()
df.describe()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72498 entries, 0 to 72497
Columns: 118 entries, id to naics_2022_6_name
dtypes: float64(38), object(80)
memory usage: 65.3+ MB
| duplicates | duration | modeled_duration | company | min_edu_levels | max_edu_levels | employment_type | min_years_experience | max_years_experience | salary | ... | lot_occupation_group | lot_v6_specialized_occupation | lot_v6_occupation | lot_v6_occupation_group | lot_v6_career_area | naics_2022_2 | naics_2022_3 | naics_2022_4 | naics_2022_5 | naics_2022_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 72476.000000 | 45182.000000 | 53208.000000 | 7.245400e+04 | 72454.000000 | 16315.000000 | 72454.000000 | 49352.000000 | 8430.000000 | 30808.000000 | ... | 72454.000000 | 7.245400e+04 | 72454.000000 | 72454.000000 | 72454.000000 | 72454.000000 | 72454.000000 | 72454.000000 | 72454.000000 | 72454.000000 |
| mean | 1.081627 | 22.322695 | 19.737615 | 3.702704e+07 | 31.482527 | 2.833834 | 1.058768 | 5.486444 | 3.773903 | 117953.755031 | ... | 2239.204475 | 2.239318e+07 | 223931.694096 | 2239.204475 | 22.281158 | 58.352555 | 587.864590 | 5883.121995 | 58834.317125 | 588345.683937 |
| std | 2.807512 | 14.359085 | 12.963769 | 3.015089e+07 | 44.747433 | 0.584028 | 0.286997 | 3.322241 | 2.576739 | 45133.878359 | ... | 285.424309 | 2.854275e+06 | 28542.747473 | 285.424309 | 2.854360 | 18.626415 | 186.259064 | 1864.093904 | 18642.971892 | 186431.744508 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 15860.000000 | ... | 1111.000000 | 1.111101e+07 | 111110.000000 | 1111.000000 | 11.000000 | 11.000000 | 111.000000 | 1111.000000 | 11115.000000 | 111150.000000 |
| 25% | 0.000000 | 11.000000 | 10.000000 | 6.505993e+06 | 2.000000 | 3.000000 | 1.000000 | 3.000000 | 2.000000 | 84928.500000 | ... | 2310.000000 | 2.310101e+07 | 231010.000000 | 2310.000000 | 23.000000 | 52.000000 | 522.000000 | 5223.000000 | 52232.000000 | 522320.000000 |
| 50% | 0.000000 | 18.000000 | 16.000000 | 3.761516e+07 | 2.000000 | 3.000000 | 1.000000 | 5.000000 | 3.000000 | 116300.000000 | ... | 2311.000000 | 2.311131e+07 | 231113.000000 | 2311.000000 | 23.000000 | 54.000000 | 541.000000 | 5415.000000 | 54151.000000 | 541519.000000 |
| 75% | 1.000000 | 32.000000 | 28.000000 | 4.330689e+07 | 99.000000 | 3.000000 | 1.000000 | 8.000000 | 5.000000 | 145600.000000 | ... | 2311.000000 | 2.311131e+07 | 231113.000000 | 2311.000000 | 23.000000 | 56.000000 | 561.000000 | 5614.000000 | 56149.000000 | 561499.000000 |
| max | 100.000000 | 59.000000 | 59.000000 | 1.082365e+08 | 99.000000 | 4.000000 | 3.000000 | 15.000000 | 14.000000 | 500000.000000 | ... | 2712.000000 | 2.712111e+07 | 271211.000000 | 2712.000000 | 27.000000 | 99.000000 | 999.000000 | 9999.000000 | 99999.000000 | 999999.000000 |
8 rows × 38 columns
Data Cleaning & Preprocessing
Drop Unnecessary Columns
Which columns should be dropped, and why?
The columns selected for removal are considered redundant because they either provide duplicate information, are unnecessary for analysis, or have more detailed equivalents in the dataset. For example, "ID" serves as a unique identifier but is often not needed for analysis, while "URL" and "ACTIVE_URLS" contain job posting links that are useful externally but not critical for data processing. Similarly, "LAST_UPDATED_TIMESTAMP" is dropped because "LAST_UPDATED_DATE" already provides update information in a more readable format. The "DUPLICATES" column, which likely flags repeated entries, is also removed since duplicates can be handled separately.
Additionally, industry and occupational classification columns like "NAICS2" to "NAICS6" and "SOC_2", "SOC_3", "SOC_5" are removed because these represent different levels of classification, and more relevant or updated versions (e.g., "NAICS_2022_2" to "NAICS_2022_6") are already present in the dataset. Removing these redundant columns helps streamline the dataset, making it more efficient to analyze without losing valuable information.
columns_to_drop = [
"id", "duplicates", "last_updated_timestamp",
"naics2", "naics3", "naics4", "naics5", "naics6",
"soc_2", "soc_3", "soc_5"
]
df = df.drop(columns=[col for col in columns_to_drop if col in df.columns], inplace=False)
print("Columns after dropping:", df.columns)
df.head()Columns after dropping: Index(['last_updated_date', 'posted', 'expired', 'duration', 'title_raw',
'body', 'modeled_expired', 'modeled_duration', 'company',
'company_name',
...
'naics_2022_2', 'naics_2022_2_name', 'naics_2022_3',
'naics_2022_3_name', 'naics_2022_4', 'naics_2022_4_name',
'naics_2022_5', 'naics_2022_5_name', 'naics_2022_6',
'naics_2022_6_name'],
dtype='object', length=107)
| last_updated_date | posted | expired | duration | title_raw | body | modeled_expired | modeled_duration | company | company_name | ... | naics_2022_2 | naics_2022_2_name | naics_2022_3 | naics_2022_3_name | naics_2022_4 | naics_2022_4_name | naics_2022_5 | naics_2022_5_name | naics_2022_6 | naics_2022_6_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9/6/2024 | 6/2/2024 | 6/8/2024 | 6.0 | Enterprise Analyst (II-III) | 31-May-2024\n\nEnterprise Analyst (II-III)\n\n... | False | 6.0 | 894731.0 | Murphy USA | ... | 44.0 | Retail Trade | 441.0 | Motor Vehicle and Parts Dealers | 4413.0 | Automotive Parts, Accessories, and Tire Retailers | 44133.0 | Automotive Parts and Accessories Retailers | 441330.0 | Automotive Parts and Accessories Retailers |
| 1 | 8/2/2024 | 6/2/2024 | 8/1/2024 | NaN | Oracle Consultant - Reports (3592) | Oracle Consultant - Reports (3592)\n\nat SMX i... | False | NaN | 133098.0 | Smx Corporation Limited | ... | 56.0 | Administrative and Support and Waste Managemen... | 561.0 | Administrative and Support Services | 5613.0 | Employment Services | 56132.0 | Temporary Help Services | 561320.0 | Temporary Help Services |
| 2 | 9/6/2024 | 6/2/2024 | 7/7/2024 | 35.0 | Data Analyst | Taking care of people is at the heart of every... | False | 8.0 | 39063746.0 | Sedgwick | ... | 52.0 | Finance and Insurance | 524.0 | Insurance Carriers and Related Activities | 5242.0 | Agencies, Brokerages, and Other Insurance Rela... | 52429.0 | Other Insurance Related Activities | 524291.0 | Claims Adjusting |
| 3 | 9/6/2024 | 6/2/2024 | 7/20/2024 | 48.0 | Sr. Lead Data Mgmt. Analyst - SAS Product Owner | About this role:\n\nWells Fargo is looking for... | False | 10.0 | 37615159.0 | Wells Fargo | ... | 52.0 | Finance and Insurance | 522.0 | Credit Intermediation and Related Activities | 5221.0 | Depository Credit Intermediation | 52211.0 | Commercial Banking | 522110.0 | Commercial Banking |
| 4 | 6/19/2024 | 6/2/2024 | 6/17/2024 | 15.0 | Comisiones de $1000 - $3000 por semana... Comi... | Comisiones de $1000 - $3000 por semana... Comi... | False | 15.0 | 0.0 | Unclassified | ... | 99.0 | Unclassified Industry | 999.0 | Unclassified Industry | 9999.0 | Unclassified Industry | 99999.0 | Unclassified Industry | 999999.0 | Unclassified Industry |
5 rows × 107 columns
Handle Missing Values
How should missing values be handled?
Missing values should be handled strategically based on their impact on analysis. First, visualizing missing data with a heatmap helps identify patterns and assess severity. Columns with more than 50% missing values are dropped to avoid unreliable or incomplete data. For numerical fields like "Salary", filling missing values with the median ensures the data remains representative without being skewed by outliers. Categorical fields like "Industry" are filled with "Unknown" to maintain completeness while preserving interpretability. This approach balances data retention and accuracy, ensuring meaningful analysis without introducing bias.
import missingno as msno
import matplotlib.pyplot as plt
# Visualize missing data
# Identify columns with >10% missing values
missing_threshold = 0.1
missing_cols = df.columns[df.isnull().mean() > missing_threshold]
# Filter DataFrame
df_missing = df[missing_cols]
# Generate heatmap
plt.figure(figsize=(12, 6))
msno.heatmap(df_missing)
plt.title("Missing Values Heatmap (Filtered)")
plt.show()
# Drop columns with >50% missing values
df.dropna(thresh=len(df) * 0.5, axis=1, inplace=True)
# Fill missing values
df_original = pd.read_csv("lightcast_job_postings.csv")
df["salary"] = df_original["salary"]
salary_col = "salary" if "salary" in df.columns else None
if salary_col:
df[salary_col].fillna(df[salary_col].median(), inplace=True)
else:
print("Warning: No salary-related column found!")
df["naics6_name"].fillna("Unknown", inplace=True)<Figure size 1200x600 with 0 Axes>

/var/folders/pb/hmwvlqh13hzcxgxncl6cfhlr0000gn/T/ipykernel_71193/698073159.py:32: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df[salary_col].fillna(df[salary_col].median(), inplace=True)
/var/folders/pb/hmwvlqh13hzcxgxncl6cfhlr0000gn/T/ipykernel_71193/698073159.py:37: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df["naics6_name"].fillna("Unknown", inplace=True)
Remove Duplicates
To ensure each job is counted only once, we remove duplicates based on job title, company, location, and posting date.
print("Existing columns in DataFrame:", df.columns.tolist()) # Display actual column names
# Convert column names to lowercase for case-insensitive matching
df.columns = df.columns.str.lower()
columns_to_check = ["title", "company", "location", "posted"]
existing_columns = [col for col in columns_to_check if col in df.columns]
if not existing_columns:
raise ValueError("None of the specified columns exist in the DataFrame. Check column names!")
print("Before removing duplicates:")
print(df[existing_columns].head())
df = df.drop_duplicates(subset=existing_columns, keep="first")
print("\nAfter removing duplicates:")
print(df[existing_columns].head())
print("\nDuplicates removed based on:", existing_columns)Existing columns in DataFrame: ['last_updated_date', 'posted', 'expired', 'duration', 'title_raw', 'body', 'modeled_expired', 'modeled_duration', 'company', 'company_name', 'company_raw', 'company_is_staffing', 'education_levels', 'education_levels_name', 'min_edu_levels', 'min_edu_levels_name', 'employment_type', 'employment_type_name', 'min_years_experience', 'is_internship', 'remote_type', 'remote_type_name', 'location', 'city', 'city_name', 'county', 'county_name', 'msa', 'msa_name', 'state', 'state_name', 'county_outgoing', 'county_name_outgoing', 'county_incoming', 'county_name_incoming', 'msa_outgoing', 'msa_name_outgoing', 'msa_incoming', 'msa_name_incoming', 'naics2_name', 'naics3_name', 'naics4_name', 'naics5_name', 'naics6_name', 'title', 'title_name', 'title_clean', 'certifications', 'certifications_name', 'onet', 'onet_name', 'onet_2019', 'onet_2019_name', 'cip6', 'cip6_name', 'cip4', 'cip4_name', 'cip2', 'cip2_name', 'soc_2021_2', 'soc_2021_2_name', 'soc_2021_3', 'soc_2021_3_name', 'soc_2021_4', 'soc_2021_4_name', 'soc_2021_5', 'soc_2021_5_name', 'lot_career_area', 'lot_career_area_name', 'lot_occupation', 'lot_occupation_name', 'lot_specialized_occupation', 'lot_specialized_occupation_name', 'lot_occupation_group', 'lot_occupation_group_name', 'lot_v6_specialized_occupation', 'lot_v6_specialized_occupation_name', 'lot_v6_occupation', 'lot_v6_occupation_name', 'lot_v6_occupation_group', 'lot_v6_occupation_group_name', 'lot_v6_career_area', 'lot_v6_career_area_name', 'soc_2_name', 'soc_3_name', 'soc_4', 'soc_4_name', 'soc_5_name', 'naics_2022_2', 'naics_2022_2_name', 'naics_2022_3', 'naics_2022_3_name', 'naics_2022_4', 'naics_2022_4_name', 'naics_2022_5', 'naics_2022_5_name', 'naics_2022_6', 'naics_2022_6_name', 'salary']
Before removing duplicates:
title company \
0 ET29C073C03D1F86B4 894731.0
1 ET21DDA63780A7DC09 133098.0
2 ET3037E0C947A02404 39063746.0
3 ET2114E0404BA30075 37615159.0
4 ET0000000000000000 0.0
location posted
0 {\n "lat": 33.20763,\n "lon": -92.6662674\n} 6/2/2024
1 {\n "lat": 44.3106241,\n "lon": -69.7794897\n} 6/2/2024
2 {\n "lat": 32.7766642,\n "lon": -96.7969879\n} 6/2/2024
3 {\n "lat": 33.4483771,\n "lon": -112.0740373\n} 6/2/2024
4 {\n "lat": 37.6392595,\n "lon": -120.9970014\n} 6/2/2024
After removing duplicates:
title company \
0 ET29C073C03D1F86B4 894731.0
1 ET21DDA63780A7DC09 133098.0
2 ET3037E0C947A02404 39063746.0
3 ET2114E0404BA30075 37615159.0
4 ET0000000000000000 0.0
location posted
0 {\n "lat": 33.20763,\n "lon": -92.6662674\n} 6/2/2024
1 {\n "lat": 44.3106241,\n "lon": -69.7794897\n} 6/2/2024
2 {\n "lat": 32.7766642,\n "lon": -96.7969879\n} 6/2/2024
3 {\n "lat": 33.4483771,\n "lon": -112.0740373\n} 6/2/2024
4 {\n "lat": 37.6392595,\n "lon": -120.9970014\n} 6/2/2024
Duplicates removed based on: ['title', 'company', 'location', 'posted']
Exploratory Data Analysis (EDA)
EDA helps uncover patterns in job postings and salaries across industries. These insights assist job seekers in making informed career decisions.
Job Postings by Industry
Why this visualization?
This bar chart helps identify which industries have the highest number of job postings. It provides insights into industry demand, helping job seekers target sectors with more opportunities.
import plotly.express as px
import plotly.io as pio
# Set Plotly renderer for Quarto or Jupyter
pio.renderers.default = "notebook"
# Get top 20 industries by job postings
top_n = 20
industry_counts = df["naics6_name"].value_counts().nlargest(top_n).reset_index()
industry_counts.columns = ["Industry", "Count"]
# Create horizontal bar chart with a taller y-axis
fig = px.bar(
industry_counts,
x="Industry",
y="Count",
title=f"Top {top_n} Job Postings by Industry (NAICS6)",
labels={"Industry": "Industry", "Count": "Number of Job Postings"}
)
# Extend y-axis and increase figure height
fig.update_layout(
xaxis_title="Industry",
yaxis_title="Number of Job Postings",
yaxis=dict(range=[0, industry_counts["Count"].max() * 1.2]), # Extend y-axis
height=1000 # Increase figure height for better spacing
)
fig.show()Insights
Custom Computer Programming Services, Administrative Management, and Employment Placement Agencies have the highest job postings, indicating strong demand in tech, consulting, and staffing sectors. Computer Systems Design and Commercial Banking also show significant job availability, reflecting growth in IT and finance. Unclassified Industry has the highest count, which may indicate miscategorized job postings or emerging sectors not yet classified. Other industries like Commercial Banking, Health Insurance, and Educational Services also show moderate job availability.
Salary Distribution by Industry
Why this visualization?
This box plot is used to analyze salary distribution across the top 20 industries. It helps compare median salaries, salary variability, and outliers, which is crucial for understanding income potential in different fields.
import plotly.express as px
# Get top 20 industries by job postings
top_n = 20
top_industries = df["naics6_name"].value_counts().nlargest(top_n).index
# Filter dataset for top industries
df_filtered = df[df["naics6_name"].isin(top_industries)]
# Create the box plot with an extended y-axis
fig = px.box(
df_filtered,
x="naics6_name",
y="salary",
title=f"Salary Distribution in Top {top_n} Industries",
labels={"naics6_name": "Industry", "salary": "Salary ($)"},
points="all" # Show all outliers
)
# Extend the y-axis
fig.update_layout(
xaxis_title="Industry",
yaxis_title="Salary ($)",
yaxis=dict(range=[0, df_filtered["salary"].max() * 1.2]), # Extend y-axis 20% above max salary
height=1000 # Increase figure height for better visibility
)
fig.show()Insights
Commercial Banking and Tech-related industries show wide salary ranges, indicating opportunities for growth. Temporary Help Services has the lowest pay, reflecting short-term or contract roles. Tech and finance roles offer both high salaries and significant growth potential.
Remote vs. On-Site Jobs
Why this visualization?
This pie chart compares the distribution of remote, hybrid, and on-site jobs, showing workplace flexibility trends. It helps job seekers understand how common remote opportunities are in the current job market.
fig = px.pie(df, names="remote_type_name", title="Remote vs. On-Site Jobs")
fig.show()Insights
Majority of jobs are not explicitly classified (~78.3%), which may indicate missing or unspecified remote work details in job postings. Only 17% of jobs are fully remote, suggesting that while remote work exists, it is not yet dominant in most industries. Hybrid remote jobs (3.11%) are emerging, but still a small percentage. This indicates a slow transition toward flexible work models. On-site jobs remain the norm (1.58% explicitly labeled as “Not Remote”), reinforcing that many industries still require physical presence. Remote opportunities exist but are limited, meaning job seekers should target specific industries or roles for remote work.
Geographic Variation of Job Postings
Why this visualization?
This map visualizes the number of job postings by U.S. state, offering a clear look at where opportunities are most concentrated geographically. It helps job seekers understand which states have the highest job demand.
state_counts = df["state_name"].value_counts().reset_index()
state_counts.columns = ["State", "Job Postings"]
us_state_abbrev = {
'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR',
'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE',
'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID',
'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS',
'Missouri': 'MO', 'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV',
'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM',
'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND',
'Ohio': 'OH', 'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA',
'Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD',
'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV',
'Wisconsin': 'WI', 'Wyoming': 'WY'
}
# Add abbreviation column
state_counts["State Abbrev"] = state_counts["State"].map(us_state_abbrev)
# Create a choropleth map
fig = px.choropleth(
state_counts,
locations="State Abbrev",
locationmode="USA-states",
color="Job Postings",
color_continuous_scale="Blues",
scope="usa",
title="Job Postings by U.S. State"
)
fig.update_layout(
geo=dict(bgcolor='rgba(0,0,0,0)'),
height=600
)
fig.show()Insights Geographic Insights on Job Postings (2024–2025)
Texas and California are leading in job postings, reflecting strong economies and diverse industries (tech, energy, entertainment).Florida, New York, and Illinois also show significant job demand due to strong finance, healthcare, and logistics sectors. Southeastern and Midwestern states like North Carolina, Georgia, and Ohio provide moderate opportunities, often with lower cost of living. Lower posting states such as Wyoming, Montana, and Alaska may reflect regional economic focus or smaller labor markets.
This distribution suggests job seekers may benefit from targeting high-posting states or considering relocation for greater opportunity.
Job Postings Over Time
Why this visualization?
This time series plot shows how job demand has changed over time. It’s useful to spot trends, seasonal hiring spikes, or drops (e.g., holidays, recession).
df["posted_date"] = pd.to_datetime(df["posted"], errors="coerce")
# Group by month
monthly_postings = df.groupby(df["posted_date"].dt.to_period("M")).size().reset_index(name="Job Postings")
monthly_postings["Month"] = monthly_postings["posted_date"].dt.to_timestamp()
fig = px.line(
monthly_postings,
x="Month",
y="Job Postings",
title="Job Postings Over Time",
labels={"Month": "Date", "Job Postings": "Number of Job Postings"}
)
fig.update_layout(height=500)
fig.show()Insights
The line chart depicting job postings over time reveals a clear temporal trend in hiring activity throughout mid-2024. From early May to the end of June, there is a noticeable decline in the number of job postings, dropping from around 14,000 to approximately 12,200. This dip could be attributed to seasonal factors, such as the end of academic semesters, mid-year budget reviews, or general summer slowdowns in corporate recruitment cycles. However, beginning in early July, the job market shows a sharp rebound, with postings rising significantly through August, eventually stabilizing at around 14,700. This resurgence likely reflects renewed hiring efforts following budget resets or organizational planning periods. The uptick in late summer also aligns with typical Q3 hiring trends, where companies ramp up recruitment ahead of the final quarter. For job seekers, this pattern suggests that while opportunities may temporarily slow in early summer, late July through August presents a strong window for applications as employers actively seek talent.
Top Job Titles by Frequency
Why this visualization
This graph shows which roles are in highest demand — great for resume optimization and understanding what skills employers seek most often.
top_titles = df["title_name"].value_counts().nlargest(20).reset_index()
top_titles.columns = ["Job Title", "Count"]
fig = px.bar(
top_titles,
x="Count",
y="Job Title",
orientation="h",
title="Top 20 Job Titles by Frequency",
labels={"Count": "Number of Postings", "Job Title": "Job Title"},
height=600
)
fig.update_layout(yaxis=dict(autorange="reversed")) # highest at top
fig.show()Insights
This bar chart provides a broader view of the top 20 job titles by frequency, highlighting dominant roles within the job market. “Data Analysts” clearly lead in demand, with over 8,000 postings — reaffirming the central role of data professionals in today’s workforce.
Other high-demand roles include “Business Intelligence Analysts,” “Enterprise Architects,” and “Oracle Cloud HCM Consultants,” which suggests a strong emphasis on both data strategy and cloud-based enterprise solutions. The appearance of niche titles like “SAP Consultants,” “Data Governance Analysts,” and “Data Quality Analysts” reflects organizations’ growing need for specialized expertise in maintaining and managing data infrastructure.
Overall, this visualization reinforces the importance of analytics, enterprise systems, and data architecture in the current job market.
Top Job Titles by Frequency
Why this visualization Compares earning potential across work modes — important for evaluating the financial trade-offs between remote, hybrid, and on-site roles.
df_salary = df[df["salary"].notnull() & (df["salary"] < 300000)]
# Box plot: salary by remote_type_name
fig = px.box(
df_salary,
x="remote_type_name",
y="salary",
title="Salary Distribution by Job Type (Remote vs. On-Site)",
labels={"remote_type_name": "Job Type", "salary": "Salary ($)"},
points="all"
)
fig.update_layout(height=600)
fig.show()Insights
This box plot presents the salary distribution across different job types, comparing remote, hybrid, on-site, and unclassified roles. A key takeaway is that salaries for remote and hybrid roles tend to be higher and more consistent than those for on-site positions. Remote jobs, in particular, show a tight interquartile range clustered around $110K–$125K, indicating a strong market demand and willingness to pay for location-flexible roles.
In contrast, on-site (“Not Remote”) jobs show a broader distribution, with a wider range of salaries and a lower median. This suggests more variability in pay, possibly due to a wider mix of roles (from entry-level to senior) or differing cost-of-living adjustments by region.
Overall, the visualization reinforces a trend seen across industries: remote and hybrid roles are not only desirable for flexibility but also competitive in compensation. Job seekers aiming for high-paying opportunities may benefit from targeting remote-friendly employers, especially in tech and analytics fields.
Conclusion
The exploratory data analysis (EDA) provided a comprehensive overview of the 2024 job market landscape by examining job postings, salaries, and structural patterns across various dimensions.
Top Industries by Postings A bar chart of the most active industries revealed strong hiring demand in sectors like technology, consulting, and staffing. This highlights which fields are driving employment opportunities.
Salary Distribution by Industry A box plot compared salary ranges across top industries, showing clear disparities in compensation. Sectors like finance and tech offered both high pay and broad salary variability, while others showed lower, more consistent pay.
Remote vs. On-Site Jobs A pie chart illustrated that the majority of job postings lacked explicit remote classification, but among those that did, remote jobs were more prevalent than on-site roles. This reflects the growing demand for flexible work arrangements.
Job Postings by U.S. State A choropleth map identified geographic disparities in job availability, with states like Texas, California, and Florida leading in job volume. This helps job seekers target locations with strong hiring activity.
Job Postings Over Time A time series line chart revealed a seasonal trend: a dip in postings during early summer followed by a strong recovery in July and August. This suggests mid-year slowdowns and hiring rebounds tied to business cycles.
Top Job Titles by Frequency A horizontal bar chart showed that “Data Analyst” and related roles dominate the job market. The presence of various analyst titles reflects high demand for data-driven decision-making skills across industries.
Salary Distribution by Job Type A box plot comparing remote, hybrid, and on-site roles indicated that remote and hybrid jobs offer competitive — often higher — median salaries. This supports the idea that flexibility and compensation can go hand in hand.
Together, these seven visualizations provide a well-rounded understanding of the job market’s current state — from what roles are most in demand, where they are, how much they pay, and how work modes affect earning potential. This foundation supports deeper analysis and career strategy planning for job seekers and workforce analysts alike.